Peer-to-peer (P2P) lending was a phenomenon less than ten years ago, exploding in popularity by offering a break from traditional banking. Individuals flocked to these alternative credit markets to finance their small business ventures, home purchases, and to consolidate debt. Although direct P2P lending has undergone changes over recent years, it remains a viable option for borrowers and investors.
The global P2P lending market is anticipated to grow from $84 billion in 2021 to $706 billion by 2030, according to figures from Precedence Research.1 This analysis takes a closer look at the mechanics of P2P lending to gain a better understanding of what considerations are factored into decisions to apply for, issue, and provide financing via P2P platforms. Analysis and statistical testing identifies credit underwriting policy and borrowers’ failure to fully pay as variables of interest that should be considered in future in-depth analyses.
P2P lending is the provision of financing without a traditional bank as the source of funds; it is, like it sounds, peers lending money to their peers. Instead of banks, online lending platforms provide a service that connects willing lenders, or investors, with individuals seeking to borrow funds. Historically, these investors have been predominantly private individuals seeking alternate forms of investments, wherein they receive the interest earned on the money they loan out.
Borrowers, on the other hand, are connected to feasible funding that they might not have otherwise been able to attain. Many borrowers participating in P2P lending did or would have experienced difficulties qualifying for traditional loans from banks. This perception of higher risk among the lenders can often translate into higher interest rates. P2P platforms screen borrowers and set rates and terms but it is ultimately up to the lender whether they will provide the funds.
The P2P market was dominated by LendingClub during the early rise of P2P lending, and it remains a leader in the industry. It makes money by charging borrowers an origination fee, charging investors a service fee, and selling loans in secondary markets. LendingClub’s typical annual percentage rate (APR) is between 5.99% and 35.89% while the origination fee of 1% to 6% is taken off the top of loans. Borrowers on LendingClub typically have good-to-excellent credit (700 or higher credit score) and a low debt-to-income ratio.
Our exploratory data analysis will closely adhere to the below 9-step checklist presented in Chapter 4 of The Art of Data Science.2
Our dataset contains over 9,500 observations of loan data from LendingClub from between 2007 and 2015. We obtained the dataset from Kaggle here: https://www.kaggle.com/datasets/urstrulyvikas/lending-club-loan-data-analysis
Our work is stored on our team GitHub here: https://github.com/jschild01/JMB_DATS_6101
Below are the variables in the dataset and their accompanying definitions as supplied by Kaggle:
| Variable | Definition |
|---|---|
| credit.policy | 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise. |
| purpose | The purpose of the loan (takes values creditcard, debtconsolidation, educational, majorpurchase, smallbusiness, and all_other). |
| int.rate | The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates. |
| installment | The monthly installments owed by the borrower if the loan is funded. |
| log.annual.inc | The natural log of the self-reported annual income of the borrower. |
| dti | The debt-to-income ratio of the borrower (amount of debt divided by annual income). |
| fico | The FICO credit score of the borrower. |
| days.with.cr.line | The number of days the borrower has had a credit line. |
| revol.bal | The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle). |
| revol.util | The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available). |
| inq.last.6mths | The borrower’s number of inquiries by creditors in the last 6 months. |
| delinq.2yrs | The number of times the borrower had been 30+ days past due on a payment in the past 2 years. |
| pub.rec | The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments). |
| not.fully.paid | Whether the borrower will be fully paid or not. |
Our analysis will explore things such as income-to-debt ratios, credit score, interest rates, and delinquencies among direct P2P borrowers in an attempt to understand the risks and opportunities associated with P2P lending. Specifically, we intend to examine the impact that these variables have on who received loans and who defaulted on their loans between 2007 and 2015.
We will seek to answer few of the following questions:
For our analysis we will use the ezids, tidyverse, corrplot, scales, gridExtra, expss, knitr, kableExtra, broom, and purr libraries. Our dataset contains 9578 rows of data with 14 columns and is structured like this:
| Name | Class | Length | Frequency |
|---|---|---|---|
| credit.policy | numeric | 9578 | 1=7710, 0=1868 |
| purpose | character | 9578 | debt_consolidation=3957, all_other=2331, credit_card=1262, home_improvement=629, small_business=619, major_purchase=437, educational=343 |
| int.rate | numeric | 9578 | 0.1253=354, 0.0894=299, 0.1183=243, 0.1218=215, 0.0963=210, 0.1114=206, 0.08=198, 0.1287=197, 0.1148=193, 0.0859=187, Other values=7276 |
| installment | numeric | 9578 | 317.72=41, 316.11=34, 319.47=29, 381.26=27, 662.68=27, 156.1=24, 320.95=24, 188.02=23, 334.67=23, 669.33=23, Other values=9303 |
| log.annual.inc | numeric | 9578 | 11.00209984=308, 10.81977828=248, 10.30895266=224, 10.59663473=224, 10.71441777=221, 11.22524339=196, 11.15625052=165, 10.77895629=149, 10.91508846=147, 11.08214255=146, Other values=7550 |
| dti | numeric | 9578 | 0=89, 10=19, 0.6=16, 6=13, 12=13, 13.16=13, 15.1=13, 19.2=13, 8.21=12, 10.8=12, Other values=9365 |
| fico | numeric | 9578 | 687=548, 682=536, 692=498, 697=476, 702=472, 707=444, 667=438, 677=427, 717=424, 662=414, Other values=4901 |
| days.with.cr.line | numeric | 9578 | 3660=50, 3630=48, 3990=46, 4410=44, 3600=41, 2550=38, 4080=38, 1800=37, 3690=37, 4020=35, Other values=9164 |
| revol.bal | numeric | 9578 | 0=321, 255=10, 298=10, 682=9, 346=8, 182=6, 1085=6, 2229=6, 1=5, 6=5, Other values=9192 |
| revol.util | numeric | 9578 | 0=297, 0.5=26, 0.3=22, 47.8=22, 73.7=22, 0.1=21, 3.3=21, 0.2=20, 0.7=20, 1=20, Other values=9087 |
| inq.last.6mths | numeric | 9578 | 0=3637, 1=2462, 2=1384, 3=864, 4=475, 5=278, 6=165, 7=100, 8=72, 9=47, Other values=94 |
| delinq.2yrs | numeric | 9578 | 0=8458, 1=832, 2=192, 3=65, 4=19, 5=6, 6=2, 7=1, 8=1, 11=1, Other values=1 |
| pub.rec | numeric | 9578 | 0=9019, 1=533, 2=19, 3=5, 4=1, 5=1 |
| not.fully.paid | numeric | 9578 | 0=8045, 1=1533 |
By examining the structure of our data, we can see that there is only one character variable which like a factor, and some of the numeric variables look like logicals.
Here we can see the top and bottom rows of our dataset to get a better feel for the data. This will help us better understand the values in our dataset and how to most effectively deal with them.
| credit.policy | purpose | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | debt_consolidation | 0.1189 | 829.10 | 11.3504 | 19.48 | 737 | 5639.958 | 28854 | 52.1 | 0 | 0 | 0 | 0 |
| 1 | credit_card | 0.1071 | 228.22 | 11.0821 | 14.29 | 707 | 2760.000 | 33623 | 76.7 | 0 | 0 | 0 | 0 |
| 1 | debt_consolidation | 0.1357 | 366.86 | 10.3735 | 11.63 | 682 | 4710.000 | 3511 | 25.6 | 1 | 0 | 0 | 0 |
| 1 | debt_consolidation | 0.1008 | 162.34 | 11.3504 | 8.10 | 712 | 2699.958 | 33667 | 73.2 | 1 | 0 | 0 | 0 |
| 1 | credit_card | 0.1426 | 102.92 | 11.2997 | 14.97 | 667 | 4066.000 | 4740 | 39.5 | 0 | 1 | 0 | 0 |
| credit.policy | purpose | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | all_other | 0.1461 | 344.76 | 12.1808 | 10.39 | 672 | 10474.000 | 215372 | 82.1 | 2 | 0 | 0 | 1 |
| 0 | all_other | 0.1253 | 257.70 | 11.1419 | 0.21 | 722 | 4380.000 | 184 | 1.1 | 5 | 0 | 0 | 1 |
| 0 | debt_consolidation | 0.1071 | 97.81 | 10.5966 | 13.09 | 687 | 3450.042 | 10036 | 82.9 | 8 | 0 | 0 | 1 |
| 0 | home_improvement | 0.1600 | 351.58 | 10.8198 | 19.18 | 692 | 1800.000 | 0 | 3.2 | 5 | 0 | 0 | 1 |
| 0 | debt_consolidation | 0.1392 | 853.43 | 11.2645 | 16.28 | 732 | 4740.000 | 37879 | 57.0 | 6 | 0 | 0 | 1 |
The top and bottom rows of our dataset indicate the data is structured in an acceptable way and that our variables match up with the values for each column.
According to the Kaggle site where we got this dataset from, there are 9,578 rows and 14 columns, which matches what we have. The site also shows that there is no missing data. We can verify that by adding the total number of missing cells in the dataset, which is 0, and check the total number of null cells, which is 0. We can also check if the observations are unique, and we see that all 9578 rows are unique.
This means the data looks good so far and we can now move on to the descriptive statistics part of our EDA.
| credit.policy | purpose | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Min. :0.000 | Length:9578 | Min. :0.0600 | Min. : 15.67 | Min. : 7.548 | Min. : 0.000 | Min. :612.0 | Min. : 179 | Min. : 0 | Min. : 0.0 | Min. : 0.000 | Min. : 0.0000 | Min. :0.00000 | Min. :0.0000 |
| Q1 | 1st Qu.:1.000 | Class :character | 1st Qu.:0.1039 | 1st Qu.:163.77 | 1st Qu.:10.558 | 1st Qu.: 7.213 | 1st Qu.:682.0 | 1st Qu.: 2820 | 1st Qu.: 3187 | 1st Qu.: 22.6 | 1st Qu.: 0.000 | 1st Qu.: 0.0000 | 1st Qu.:0.00000 | 1st Qu.:0.0000 |
| Median | Median :1.000 | Mode :character | Median :0.1221 | Median :268.95 | Median :10.929 | Median :12.665 | Median :707.0 | Median : 4140 | Median : 8596 | Median : 46.3 | Median : 1.000 | Median : 0.0000 | Median :0.00000 | Median :0.0000 |
| Mean | Mean :0.805 | NA | Mean :0.1226 | Mean :319.09 | Mean :10.932 | Mean :12.607 | Mean :710.8 | Mean : 4561 | Mean : 16914 | Mean : 46.8 | Mean : 1.577 | Mean : 0.1637 | Mean :0.06212 | Mean :0.1601 |
| Q3 | 3rd Qu.:1.000 | NA | 3rd Qu.:0.1407 | 3rd Qu.:432.76 | 3rd Qu.:11.291 | 3rd Qu.:17.950 | 3rd Qu.:737.0 | 3rd Qu.: 5730 | 3rd Qu.: 18250 | 3rd Qu.: 70.9 | 3rd Qu.: 2.000 | 3rd Qu.: 0.0000 | 3rd Qu.:0.00000 | 3rd Qu.:0.0000 |
| Max | Max. :1.000 | NA | Max. :0.2164 | Max. :940.14 | Max. :14.528 | Max. :29.960 | Max. :827.0 | Max. :17640 | Max. :1207359 | Max. :119.0 | Max. :33.000 | Max. :13.0000 | Max. :5.00000 | Max. :1.0000 |
We have an idea of what to expect for a few variables, such as interest rate and credit score, so we were able to test the dataset against some of our expectations to gauge its reliability. By inspecting the summary table, we can see that interest rates for the data are between 6% and 21.64% and credit scores range from 612 to 827. Although interest rates might seem to reach excessively high rates or credit scores too meager, the P2P market tended to consist of more risky loans. This aligns with our expectation and reinforces our confidence in the dataset.
The range of the utilization, or the percent of credit being used, is between 0% and 119%. Someone utilizing more than 100% of the credit available to them initially seemed erroneous; however, this can occur from technical error, creditors and collectors reporting at different date/times, borrowers opening and closing credit lines, or possibly when borrowers appear as authorized users of others’ credit lines. Regardless of the reason, only 27 loans within our dataset appear to exceed the standard maximum of 100% so we do not expect this to have a significant effect on our analysis, thereby allowing us to continue with our EDA.
We want to see a measure of dispersion/variation, namely standard deviation, for the numeric variables. The values are as follows:
| Variable | Standard Deviation |
|---|---|
| credit.policy | 0.40 |
| int.rate | 0.03 |
| installment | 207.07 |
| log.annual.inc | 0.61 |
| dti | 6.88 |
| fico | 37.97 |
| days.with.cr.line | 2496.93 |
| revol.bal | 33756.19 |
| revol.util | 29.01 |
| inq.last.6mths | 2.20 |
| delinq.2yrs | 0.55 |
| pub.rec | 0.26 |
| not.fully.paid | 0.37 |
Here, we have used standard deviation as a measure of dispersion to
understand the spread of each of the variables. While variables like
credit.policy and not.fully.paid can only take
one of 2 values, namely 0 and 1, variables like
revol.bal,days.with.cr.line and
installment have the highest standard deviations in that
order.
We are satisfied with the data so far, so the next step is to begin visualizing it.
Below is a histogram for each non-logical numeric variable to help us
understand how the data is distributed:
One of the primary things we are looking for, normality, we can see
with the log.annual.inc variable. Some other variables look
at least somewhat normal, such as days.with.cr.line,
fico, installment, and int.rate.
We will make Q-Q plots later to get a better sense of the normality of
these variables.
The revol.util variable is fairly flat, and the
dti a little more rounded, but not normal looking. There
are four variables where we clearly have outlier issues:
delinq.2yrs, inq.last.6.mths,
pub.rec, and revol.bal. For those we can see
one or a small number of large bars on the left, and then a fairly flat
graph around zero after that.
To help get a better look at the outliers, below are boxplots for the
same variables:
Setting the outlier.alpha to 0.2 to compensate for overplotting, we
can now get a better understanding of the four variables with outlier
issues. For delinq.2yrs and public.rec, nearly
all of the observations are 0, with a small number of outliers at
integer values above zero. It looks like these two variables will not be
very useful to us.
The inq.last.6mths does have some range, with most
observations one of 0, 1, or 2. We can consider taking out the outliers
later on, and hopefully this variable will prove useful.
The revol.bal variable actually has a good range, but
the outliers are so far out that it is difficult to see. For this
variable we’ll want to consider removing outliers and possibly
transforming the data by taking the natural log of it, like was already
done with the annual.inc variable.
We can take a look at the factor and logical variables with bar
charts:
From this we can see that the purpose variable would be
a good candidate to perform ANOVA tests on. We can see that for each
value of these variables there are at least a few hundred observations,
so we should have a large enough sample size in further analysis and
statistical tests. At this point we can comfortable convert
credit.policy and not.fully.paid to logicals,
and purpose to a factor.
Now we can further explore the data to see how the numeric variables
differ based on the on the credit.policy,
not.fully.paid, and purpose variables. Let’s
make some boxplots to visualize this.
Here we look at the numeric variables for individual sub-categories
of the logical and categorical variables. This helps us comprehensively
understand how the numerical variables are distributed with respect to
each of the logical and categorical values.
Looking at this there are a few numeric variables that look fairly
different depending on credit.policy: fico,
inq.last.6.mths, and int.rate.
Looking at this there aren’t any variables that visually stand out to a significant degree.
From this we can see some of the purpose categories
stand out for certain variables. The debt_consolidation and
credit_card purposes stand out for dti and
revol.util, while small_business stands out
for installment and int.rate. We will perform
one-way ANOVA tests later to confirm what we can see here.
The correlation plot below gives an overview of how each of the
variables in the dataset may relate to each other. This plot includes
every variable except the purpose variable.
This allows us to quantify some of the stand-outs in the additional
boxplots such as int.rate, fico, and
inq.last.6mths compared to credit.policy.
There are also four other correlations that are either greater than 0.4
or less than -0.4 that we want to explore further.
Based on the four variable correlations we have not looked at yet
that are greater than 0.4 or less than -0.4, these scatter plots allow
us to get a better understanding of those correlations.
Using a point alpha to 0.2 to compensate for overplotting we can see a clear trend of interest rates falling as the borrower’s FICO score goes up, which matches our expectations. All of the FICO scores end in 2 or 7, which is why the results fall into vertical lines.
Here we can see that as the borrower’s interest rate climbs, so does the revolving line utilization rate. This makes sense because we would expect a higher interest rate to be associated with a more risky loan. If the loan is riskier the borrower probably has more difficulty getting credit, and therefore would make use of a higher percentage of the credit they do have available.
As the natural log of the borrower’s annual income increases, we see that the installment of their loan does as well. This also matches our expectations, as those who make more money would likely be able to make higher payments.
We see that as the borrower’s FICO score goes up, their revolving line utilization rate decreases, related to the explanations from the above scatter plots. In the end, this also matches our expectations.
Before moving into more advanced statistical tests we want to take an
initial look at loans that meet the credit underwriting criteria vs the
borrower not fully paying. Based on the credit.policy and
not.fully.paid variables, we can calculate the percentage
of borrowers who did not fully pay based on if they met the credit
underwriting criteria.
| Meets Credit Policy | Percent Not Fully Paid |
|---|---|
| FALSE | 27.8% |
| TRUE | 13.2% |
From this we can see that about 13.2% of borrowers who met the credit underwriting criteria did not fully pay, while for the borrowers who did not meet the credit underwriting criteria about 27.8% did not fully pay.
This indicates borrowers who did not meet the credit underwriting criteria were about twice as likely to default on their loans than those who did meet the criteria. For comparison, default rates on loans from commercial banks for the same period as our dataset averaged 4.48%, with a maximum default rate of 7.49% default rate towards the end of 2009, according to the St. Louis Federal Reserve Bank.3
We can confirm that these loans were definitely riskier, especially so if they did not meet the credit underwriting criteria. Based on this a potential lender would be wise to give serious consideration to whether or not the potential borrower meets the credit underwriting policy.
We conducted and performed Q-Q plots, ANOVAs, and chi-squared tests to examine the variables in our dataset and gain a better understanding of how they interact with each other.
Since our dataset is a subset of all loans facilitated by LendingClub between the years 2007 and 2015, we can treat it as a sample and conduct t-tests which confirm that the means of all the variables of the sample coincide with the means of our population. However, we cannot perform z-tests on this data because we cannot make an estimate about the standard deviation of the population (all loans facilitated by LendingClub).4 5 6 7 8
We want to create a Q-Q plot for each numeric variable so we can perform a normality test for each. This reinforces what we was in the histograms during our EDA.
From the plots we see can see that the variables
int.rate,log.annual.inc, and fico
are the three most normalized variables.
ANOVA indicates that there is a significant difference in the means
of all of our variables for the different categories of
purpose, except for one. Subsequent Tukey tests confirm
these differences and validate our ANOVA tests. The ANOVA test for the
one exception — delinquencies in the past two years — indicates there is
no significant difference in its mean for the different categories of
purpose. Similarly, subsequent Tukey tests confirm this
lack of a difference and validate the ANOVA test.
Chi-square tests confirm that three variables had an association
between each other. The p-values between purpose and
credit.policy, purpose and
not.fully.paid, and credit.policy and
not.fully.paid, are less than our chosen significance level
of α = 0.05, and therefore we can confidently reject the null hypotheses
for these. This indicates there are consequential relationships between
these variables.
Chi-square test for purpose vs
credit.policy:
Chi-square test for purpose vs
not.fully.paid:
Chi-square test for credit.policy vs
not.fully.paid:
From our analysis of the dataset we find that:
credit.policy and other numeric variables such as
int.rate, fico, and
inq.last.6mths.not.fully.paid and other numeric variables (except for
credit.policy).purpose,
credit.policy, and not.fully.paid.delinq.2yrs, their
mean significantly varies for different categories
ofpurpose.Private individuals historically made up the bulk of lenders in P2P markets. However, high interest rates and the prospects of risky borrowers undermined P2P lending as a legitimate financial industry. Combined with the urge for more growth by intermediaries like LendingClub, these concerns began to prompt higher lending standards and discussions about more regulation. By 2017, larger institutions and banks began to take over private individuals as the primary sources of lending in P2P markets. We assume this shift in P2P lenders altered the makeup of who receives what.
Furthermore, this dataset covers years of very different economic environments. For example, it contains data points from prior to, during, and after the 2008 financial crisis. Since we do not know the exact year of the individual loans we cannot take the time period into consideration, nor can we do time-series analysis to see how it affects the variables.
While the annual income data was given to us as a natural log, the
revolving balance was given to us unmodified. We discovered that taking
the log of of the revol.bal variable gives a better result
with a more normal distribution rendering the variable more usable.
However, some loans have a revol.bal value of 0 which
returns -Inf when the natural log is taken. We will need to revisit this
in the future. For now, we want to demonstrate the results of taking the
log of revol.baland how it increases the readability of the
data.
“Peer to Peer (P2P) Lending Market Size, Report 2022-2030.” Peer to Peer (P2P) Lending Market Size, Report 2022-2030, www.precedenceresearch.com/peer-to-peer-lending-market. Accessed 4 Nov. 2022.↩︎
Peng, R. D., & Matsui, E. (2016). The Art of Data Science: A Guide for anyone who works with data. Skybrude consulting LLC.↩︎
“Delinquency Rate on All Loans, All Commercial Banks.” Delinquency Rate on All Loans, All Commercial Banks (DRALACBN) | FRED | St. Louis Fed, 22 Aug. 2022, fred.stlouisfed.org/series/DRALACBN.↩︎
End to end case study (classification): Lending Club Data. (n.d.). Retrieved November 3, 2022, from https://towardsdatascience.com/end-to-end-case-study-classification-lending-club-data-489f8a1b100a↩︎
Yiu, T. (2019, June 19). Turning Lending Club’s Worst Loans into Investment Gold. Medium. https://towardsdatascience.com/turning-lending-clubs-worst-loans-into-investment-gold-475ec97f58ee↩︎
Lending Club Review: How it Works, Requirements and Alternatives. (n.d.). Debt.org. https://www.debt.org/credit/loans/personal/lending-club-review/↩︎
Project 1: Analysis of Lending Club’s data. (n.d.). Data Science Blog. Retrieved November 3, 2022, from https://nycdatascience.com/blog/student-works/project-1-analysis-of-lending-clubs-data/↩︎
Ph.D, M. K. (2019, April 9). LendingClub: bias in data? Machine learning and investment strategy. Retrieved November 3, 2022, from Medium website: https://michel-kana.medium.com/lendingclub-bias-in-data-machine-learning-and-investment-strategy-3a3bd1c65f0↩︎